Skip to content

split: preserve non-UTF-8 bytes in output filename generation#11397

Merged
sylvestre merged 2 commits intouutils:mainfrom
can1357:split-preserve-non-utf8-bytes-in-output-filename-generation
Apr 3, 2026
Merged

split: preserve non-UTF-8 bytes in output filename generation#11397
sylvestre merged 2 commits intouutils:mainfrom
can1357:split-preserve-non-utf8-bytes-in-output-filename-generation

Conversation

@can1357
Copy link
Copy Markdown
Contributor

@can1357 can1357 commented Mar 18, 2026

uutils split accepts non-UTF-8 prefix and suffix inputs but converts them with to_string_lossy() when building chunk filenames. GNU keeps pathname bytes intact, while uutils rewrites invalid bytes to UTF-8 replacement characters.

Reproduction Steps

d=$(mktemp -d); p=$(printf "p\377"); printf "AB" | split -b1 - "$d/$p"; ls "$d" | od -An -tx1
# Expected (GNU): 70 ff 61 61 0a 70 ff 61 62 0a
# Actual (uutils): 70 ef bf bd 61 61 0a 70 ef bf bd 61 62 0a

Impact

Chunk files are created under rewritten names instead of the requested byte paths. This breaks GNU compatibility in non-UTF-8 environments and can cause filename collisions or misdirected output files.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 19, 2026

Merging this PR will improve performance by 7.35%

⚡ 1 improved benchmark
✅ 296 untouched benchmarks
⏩ 48 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation split_bytes 431.6 µs 402 µs +7.35%

Comparing can1357:split-preserve-non-utf8-bytes-in-output-filename-generation (af4bd49) with main (209bb97)

Open in CodSpeed

Footnotes

  1. 48 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sylvestre sylvestre merged commit d2b9550 into uutils:main Apr 3, 2026
163 checks passed
sylvestre added a commit to sylvestre/coreutils-1 that referenced this pull request Apr 3, 2026
* tests/split/non-utf8.sh: New test to ensure that non-UTF-8 bytes
in the prefix and --additional-suffix are preserved as-is in output
filenames, rather than being replaced by UTF-8 replacement characters.
* tests/local.mk: Register new test.
uutils/coreutils#11397
sylvestre added a commit to sylvestre/coreutils-1 that referenced this pull request Apr 3, 2026
* tests/split/non-utf8.sh: New test to ensure that non-UTF-8 bytes
in the prefix and --additional-suffix are preserved as-is in output
filenames, rather than being replaced by UTF-8 replacement characters.
* tests/local.mk: Register new test.
uutils/coreutils#11397
hubot pushed a commit to coreutils/coreutils that referenced this pull request Apr 6, 2026
* tests/split/non-utf8.sh: New test to ensure that non-UTF-8 bytes
in the prefix and --additional-suffix are preserved as-is in output
filenames, rather than being replaced by UTF-8 replacement characters.
* tests/local.mk: Register new test.
uutils/coreutils#11397
#239
kevinburkesegment pushed a commit to kevinburkesegment/coreutils that referenced this pull request Apr 6, 2026
…#11397)

Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants